[import] Separate dataset import logic into separate modules #171

rexxars · 2017-09-13T13:48:37Z

This PR removes the dataset import logic from @sanity/core and moves it into a separate module called @sanity/import. This makes it much easier to maintain and write tests for, and makes it much more reusable.

In the process, I've rewritten a lot (most) of the logic to make it much easier to work with and debug, with the tradeoff that it will use more memory and be slightly slower.

I've also made a separate CLI tool that is installable which will only handle the dataset import logic, @sanity/import-cli.

In the process of rewriting, a few major improvements have been made:

Assets are now hashed by content, not URL. This makes it much more reliable when it comes to determining whether or not the asset needs to be uploaded.
Objects within arrays now have keys added to them if not already set, to improve realtime-behaviour
References now get assigned the _type field if not set
"Import maps" are no longer generated as documents, but kept in-memory. Less clutter, at the expense of more memory usage.
Batching is now much smarter. Batch sizes are determined based on JSON payload size instead of number of documents, which should make huge payloads much less likely.

There are a couple more things that we should do to improve the import at some point, which I've outlined in the readme, but for the most part this should now work better and more reliably than the old import.

Sorry about the humongous size of the PR.

bjoerge

Just had a minor question, otherwise it looks really great!

bjoerge · 2017-09-13T14:00:05Z

packages/@sanity/import/src/import.js

+  // Create batches of documents to import. Try to keep batches below a certain
+  // byte-size (since document may vary greatly in size depending on type etc)
+  const batches = batchDocuments(weakened)
+


Its not clear to me, so just checking: importBatches, uploadAssets and strengthenReferences must be done in sequence, right?

Sort of. importBatches must be done first, but uploadAssets and strengthenReferences could be done in parallel. The reason why we're not doing this at the moment is that if one of these operations fail, there is no way to stop the other. I've made a note of it in the readme as a definite improvement for the future, but it would require using something like a queue instead of simply mapping over items with a concurrency.

I see, thanks for the clarification! And +1 for improved error handling in the future. If strengthenReferences fail right now, it will result in dangling assets too, or are the asset uploads idempotent?

Asset uploads are done first and connected to their corresponding documents. If this operation fails, it will have dangling assets, however. When re-running the import, it should find the same assets already uploaded and reuse those, however.

Failures on strengthened references will also leave some references as weak, and this is a situation which might easily occur if you are refering to documents which do not exist (incorrect IDs or similar). I've also made a note of this in the readme, that we should attempt to check if referenced documents actually exist before starting upload.

All in all, "rolling back" a failed upload is pretty hard.

When re-running the import, it should find the same assets already uploaded and reuse those, however.

This is great. I have no further comments 👩‍⚖️

This is great. I have no further comments 👩‍⚖️

Achievement unlocked.

rexxars added 11 commits September 13, 2017 15:38

[import] Separate dataset import logic into it's own module

5eed3a1

[import] Assign filename to assets based on URL

846096a

[import] Fix strengthening of references

7532ffb

[import] Move CLI-parts out of package

182e58b

[import-cli] Initial import command line tool

03170ca

[import] Use async visibility for all commits

833fa2e

[import] Update readme

9f54cdd

[import-cli] Add readme

f5fdfbc

[import-cli] Add progress indication

3030387

[import] Emit progress events

90c4b62

[core] Utilize @sanity/import when importing ndjson

b0b7792

rexxars requested a review from bjoerge September 13, 2017 13:48

bjoerge approved these changes Sep 13, 2017

View reviewed changes

rexxars merged commit 5bd7f3e into next Sep 14, 2017

rexxars deleted the import branch September 14, 2017 10:00

rexxars restored the import branch September 14, 2017 12:12

rexxars mentioned this pull request Sep 14, 2017

[import] Separate dataset import logic into separate modules #173

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[import] Separate dataset import logic into separate modules #171

[import] Separate dataset import logic into separate modules #171

rexxars commented Sep 13, 2017

bjoerge left a comment

bjoerge Sep 13, 2017

rexxars Sep 13, 2017

bjoerge Sep 13, 2017 •

edited

Loading

rexxars Sep 13, 2017

bjoerge Sep 13, 2017

rexxars Sep 13, 2017

[import] Separate dataset import logic into separate modules #171

[import] Separate dataset import logic into separate modules #171

Conversation

rexxars commented Sep 13, 2017

bjoerge left a comment

Choose a reason for hiding this comment

bjoerge Sep 13, 2017

Choose a reason for hiding this comment

rexxars Sep 13, 2017

Choose a reason for hiding this comment

bjoerge Sep 13, 2017 • edited Loading

Choose a reason for hiding this comment

rexxars Sep 13, 2017

Choose a reason for hiding this comment

bjoerge Sep 13, 2017

Choose a reason for hiding this comment

rexxars Sep 13, 2017

Choose a reason for hiding this comment

bjoerge Sep 13, 2017 •

edited

Loading